Random Forests and Adaptive Nearest Neighbors

نویسندگان

  • Yi Lin
  • Yongho Jeon
چکیده

In this paper we study random forests through their connection with a new framework of adaptive nearest neighbor methods. We first introduce a concept of potential nearest neighbors (k-PNN’s) and show that random forests can be seen as adaptively weighted k-PNN methods. Various aspects of random forests are then studied from this perspective. We investigate the effect of terminal node sizes and splitting schemes on the performance of random forests. It has been commonly believed that random forests work best using largest trees possible. We derive a lower bound to the rate of the mean squared error of regression random forests with non-adaptive splitting schemes and show that, asymptotically, growing largest trees in such random forests is not optimal. However, it may take a very large sample size for this asymptotic result to kick in for high dimensional problems. We illustrate with simulations the effect of terminal node sizes on the prediction accuracy of random forests with other splitting schemes. In general, it is advantageous to tune the terminal node size for best performance of random forests. We further show that random forests with adaptive splitting schemes assign weights to k-PNN’s in a desirable way: for the estimation at a given target point, these random forests assign voting weights to the k-PNN’s of the target point according to the local importance of different input variables. We propose a new simple splitting scheme that achieves desirable adaptivity in a straightforward fashion. This simple scheme can be combined with existing algorithms. The resulting algorithm is computationally faster, and gives comparable results. Other possible aspects of random forests, such as using linear combinations in splitting, are also discussed. Simulations and real datasets are used to illustrate the results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Performance of small samples in quantifying structure central Zagros forests utilizing the indexes based on the nearest neighbors

Abstract Todaychr('39')s forest structure issue has converted to one of the main ecological debates in forest science. Determination of forest structure characteristics is necessary to investigate stands changing process, for silviculture interventions and revival operations planning. In order to investigate structure of the part of Ghale-Gol forests in Khorramabad, a set of indices such as Cla...

متن کامل

Adaptively Discovering Meaningful Patterns in High-Dimensional Nearest Neighbor Search

To query high-dimensional databases, similarity search (or k nearest neighbor search) is the most extensively used method. However, since each attribute of high dimensional data records only contains very small amount of information, the distance of two high-dimensional records may not always correctly reflect their similarity. So, a multi-dimensional query may have a k-nearest-neighbor set whi...

متن کامل

Estimation of Density using Plotless Density Estimator Criteria in Arasbaran Forest

    Sampling methods have a theoretical basis and should be operational in different forests; therefore selecting an appropriate sampling method is effective for accurate estimation of forest characteristics. The purpose of this study was to estimate the stand density (number per hectare) in Arasbaran forest using a variety of the plotless density estimators of the nearest neighbors sampling me...

متن کامل

An adaptive classification method for multimedia retrieval

Relevance feedback can effectively improve the performance of content-based multimedia retrieval systems. To be effective, a relevance feedback approach must be able to efficiently capture the user’s query concept from a very limited number of training samples. To address this issue, we propose a novel adaptive classification method using random forests, which is a machine learning algorithm wi...

متن کامل

Imputation of Missing Values for Unsupervised Data Using the Proximity in Random Forests

This paper presents a new procedure that imputes missing values by random forests for unsupervised data. We found that it works pretty well compared with k-nearest neighbor (kNN) and rough imputations replacing the median of the variables. Moreover, this procedure can be expanded to semisupervised data sets. The rate of the correct classification is higher than that of other conventional method...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002